Weaving web data into a diachronic corpus patchwork
نویسندگان
چکیده
This paper offers a reassessment of the role of web data in diachronic linguistic analysis. We introduce the diachronic search facilities provided by the WebCorp Linguist’s Search Engine, including the use of a new ‘heat map’ graph for the analysis of changes in collocational patterns over time. We illustrate how web data can be used to supplement data from standard corpora in lexicological studies. Our focus is on the vogue phrase credit crunch and the paper compares examples from standard corpora (BNC, Brown, LOB, Frown, FLOB) with those found in web-accessible newspaper texts. Contrary to previous studies, we do not rely on the web solely for the most up-to-date usage examples. Instead, we show how web-accessible texts dating back to the beginning of the 20 Century can be used to fill gaps in and sharpen the picture provided by standard corpora.
منابع مشابه
Gearing the Discursive Practice to the Evolution of Discipline: Diachronic Corpus Analysis of Stance Markers in Research Articles’ Methodology Section
Despite widespread interest and research among applied linguists to explore metadiscourse use, very little is known of how metadiscourse resources have evolved over time in response to the historically developing practices of academic communities. Motivated by such an ambition, the current research drew on a corpus of 874315 words taken from three leading journals of applied linguistics in orde...
متن کاملWeaving the Web(VTT) of Data
Video has become a first class citizen on the Web with broad support in all common Web browsers. Where with structured mark-up on webpages we have made the vision of the Web of Data a reality, in this paper, we propose a new vision that we name the Web(VTT) of Data, alongside with concrete steps to realize this vision. It is based on the evolving standards WebVTT for adding timed text tracks to...
متن کاملMultiple Tokenizations in a Diachronic Corpus
This paper deals with the construction of a maximally flexible corpus architecture for building and analyzing diachronic corpora. Historical data poses many challenges with regard to representation and analysis, and diachronic corpora are even more varied and unsystematic (Claridge, 2008). Since historical and diachronic corpora are so difficult and expensive to build, it is crucial that they b...
متن کاملUSAAR-CHRONOS: Crawling the Web for Temporal Annotations
This paper describes the USAAR-CHRONOS participation in the Diachronic Text Evaluation task of SemEval-2015 to identify the time period of historical text snippets. We adapt a web crawler to retrieve the original source of the text snippets and determine the publication year of the retrieved texts from their URLs. We report a precision score of >90% in identifying the text epoch. Additionally, ...
متن کاملA fully data-driven method to identify (correlated) changes in diachronic corpora
In this paper, a method for measuring synchronic corpus (dis-)similarity put forward by Kilgarriff (2001) is adapted and extended to identify trends and correlated changes in diachronic text data, using the Corpus of Historical American English (Davies 2010a) and the Google Ngram Corpora (Michel et al. 2010a). This paper shows that this fully data-driven method, which extracts word types that h...
متن کامل